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(^ Abstract 

(N 

We describe fc-MLE, a fast and efRcient local search algorithm for learning finite statisti- 

r— I cal mixtures of exponential families such as Gaussian mixture models. Mixture models are 

\^ traditionally learned using the expectation-maximization (EM) soft clustering technique that 

^ij monotonically increases the incomplete (expected complete) likelihood. Given prescribed mix- 

(/3 ture weights, the hard clustering fc-MLE algorithm iteratively assigns data to the most likely 

O weighted component and update the component models using Maximum Likelihood Estimators 

(MLEs). Using the duality between exponential families and Bregman divergences, we prove 

T— I that the local convergence of the complete likelihood of fc-MLE follows directly from the con- 

K^ vergence of a dual additively weighted Bregman hard clustering. The inner loop of fc-MLE can 

be implemented using any fc-means heuristic like the celebrated Lloyd's batched or Hartigan's 

greedy swap updates. We then show how to update the mixture weights by minimizing a cross- 

iy-s entropy criterion that implies to update weights by taking the relative proportion of cluster 

^J points, and reiterate the mixture parameter update and mixture weight update processes until 

/--s convergence. Hard EM is interpreted as a special case of fc-MLE when both the component up- 

rvq date and the weight update are performed successively in the inner loop. To initialize fc-MLE, 

T— I we propose k-MLFi++, a careful initialization of fc-MLE guaranteeing probabilistically a global 

^ bound on the best possible complete likelihood. 

^ Keywords: exponential families, mixtures, Bregman divergences, expectation-maximization 

2 (EM), fc-means loss function, Lloyd's fc-means, Hartigan and Wong's fc-means, hard EM, sparse 

EM. 

1 Introduction 

1.1 Statistical mixture models 

A statistical mixture model [M] M ~ ttt, with fc G N weighted components has underlying probability 
distribution: 



'Research performed during the January-June 2011 period. A preliminary shorter version appeared in IEEE 
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^Wip(x|6'i), (1) 



Tn[x\w,b 

i=l 

with w = {wi, ...,Wk) and 6 = {6i,...,6k) denoting the mixture parameters: The Wj's are positive 
weights summing up to one, and the 6*4 's denote the individual component parameters. (Appendix [E| 
summarizes the notations used throughout the paper.) 

Mixture models of d-dimensional GaussianaH are the most often used statistical mixtures [34J. 
In that case, each component distribution A^(/Xi,Sj) is parameterized by a mean vector /ij G M 
and a covariance matrix Sj ;^ that is symmetric and positive definite. That is, 9i = (/Xj, Sj). The 
Gaussian distribution has the following probability density defined on the support X = R*^: 

p(x;/ii,Si) = d ^ e ' ^» '■' ", (2) 

(27r)2^^S^ 



where Mq denotes the squared Mahalanobis distance |12j 

MQix,y) = ix-yfQix-y), (3) 

defined for a symmetric positive definite matrix Q >~ {Qi = S^ , the precision matrix). 

To draw a random variate from a Gaussian mixture model (GMM) with k components, we first 
draw a multinomial variate z £ {1, ..., k}, and then sample a Gaussian variate from N{fj,z, S^). A 
multivariate normal variate x is drawn from the chosen component N{fj,, T,) as follows: First, 
we consider the Cholesky decomposition of the covariance matrix: S = CC , and take a d- 
dimensional vector with coordinates being random standard normal variates: y = [yi ... y^]^ with 
yi = \^—2 log ui cos(27rti2) (for ui and U2 uniform random variates in [0, 1)). Finally, we assemble 
the Gaussian variate x as x = ^ + Cy. This drawing process emphasizes that sampling a statistical 
mixture is a doubly stochastic process by essence: First, we sample a multinomial law for choosing 
the component, and then we sample the variate from the selected component. 

Figure w[h) shows a GMM with k = 32 components learned from a color image modeled as a 
5D xyRGB point set (Figure [I[ a)). Since a GMM is a generative model, we can sample the GMM 
to create a "sample image" as shown in Figure [I[c). Observe that low frequency information of the 
image is nicely modeled by GMMs. Figure pf f ) shows a GMM with k = 32 components learned 
from a color image modeled as a high-dimensional point set. Each sx s color image patch anchored 
at (x,y) is modeled as a point in dimension d = 2 + 3s^. GMM representations of images and 
videos \Tr\ provide a compact feature representation that can be used in many applications, like in 
information retrieval (IR) engines |14| . 

In this paper, we consider the general case of mixtures of distributions belonging the same 
exponential family |50j . like Gaussian mixture models [23] (GMMs), Rayleigh mixture models |47j 
(RMMs), Laplacian mixture models (LMMs)[5], Bernoulli mixture models [5] (BMMs), Multinomial 
Mixture models [46j (MMMs), Poisson Mixture Models (PMMs) L28J, WeibuU Mixture Models [15] 
(WeiMMs), Wishart Mixture Models [22] (WisMM), etc. 

1.2 Contributions and prior work 

Expectation-Maximization [18] (EM) is a traditional algorithm for learning finite mixtures |34j . 
Banerjee et al. [9] proved that EM for mixture of exponential families amounts to perform equiv- 



^Also called Multivariate Normals (MVNs) in software packages. 





Figure 1: A RGB color image (a) is interpreted as a 5D xyRGB point set on which a Gaussian 
mixture model (GMM) with k = 32 components is trained (b). Drawing many random variates 
from the generative GMM yields a sample image(c) that keeps low-frequency visual information. 




(d) 



(e) 



(f) 



Figure 2: Modeling a color image using a Gaussian mixture model (GMM): (a) Image Baboon source 
image, (b) a 5D 32-GMM modeling depicted by its covariance ellipses, (c) hard segmentation using 
the GMM, (d) samphng the 5D GMM, (e) Mean colors (8x8 patches) for GMM with patch size 
s = 8, (f) patch mean /u for s = 8 patch size width. 



alently a soft Bregman clustering. Furthermore, this EM-Bregman soft clustering equivalence was 
extended to total Bregman soft clustering for curved exponential families |29j . Although math- 
ematically convenient, we should remember that mixture data should be hard clustered as each 
observation should emanate from exactly one component. 

It is well-known that A;-means clustering technique can be interpreted as a limit case of EM for 
isotropic Gaussian mixtures [37] . Kearns et al. [2E] casted further light on the hard/soft relationship 
using an information-theoretic analysis of hard fe-means and soft expectation-mazimization assign- 
ments in clustering. Banerjee et al proved a mathematical equivalence between the estimation 
of maximum likelihood of exponential family mixtures (MLME, Maximum Likelihood Mixture Es- 
timation) and a rate distortion problem for Bregman divergences. Furthermore, Banerjee et al. [8] 
proposed the hardened expectation for the special case of von Mises-Fisher mixtures (hard EM, 
Section 4.2 of [8j) for computational efficiency. 

In this paper, we build on the duality between Bregman divergences and exponential families |9] 
to design A;-MLE that iteratively (1) assigns data to mixture components, (2) update mixture 
parameters a la fe-means and repeat step (1) until local convergence, (3) update weights and reiterate 
from (1) until local convergence (see Algorithm [I]). We prove that fe-MLE maximizes monotonically 
the complete likelihood function. We also discuss several initialization strategies and describe a 
probabilistic initialization fc-MLE+-|- with guaranteed performance bounds. 

The paper is organized as follows: Section [2] recall the basic notions of exponential families, Leg- 
endre transform, Bregman divergences, and demonstrate the duality between Bregman divergences 
and exponential families to study the Maximum Likelihood Estimator (MLE). Section ^ presents 
the framework of A:-MLE for mixtures with prescribed weights, based on the Bregman-exponential 
family duality. The generic A:-MLE algorithm is described in Section |4j and Section [5] discusses 
on proximity location data-structures to speed up the assignment step of the algorithm. Section [6] 
presents fe-MLE-|— |-, a probabilistic initialization of fc-MLE. Finally, Section [7] concludes the paper 
and discusses on avenues for future research. 

2 Preliminaries 

2.1 Exponential family 

An exponential family [13] Ep is a set of parametric probability distributions 

EF = {PF{x;e)\e£e} (4) 

whose probability densitjjj can be decomposed canonically as 

p^(x;0) = e<*(-')'^>-^(^)+'=(^) (5) 

where t{x) denotes the sufficient statistics, 6 the natural parameter, F(0) the log-normalizer, 
and k(x) a term related to an optional auxiliary carrier measure, {x, y) denotes the inner product 
(i.e., x^y for vectors ii{X'^Y) for matrices, etc.). Let 



Q = \e\ I pf{x] e)dx<oo\ 



(6) 



•^For sake of simplicity and brevity, we consider without loss of generality in the remainder continuous random 
variables on R'*. We do not introduce the framework of probability measures nor Radon-Nikodym densities. 



denotes the natural parameter space. The dimension D of the natural parameter space is called the 
order of the family. For the d-variate Gaussian distribution, the order is D = d-\ — ^ — - = -^ — -. 
It can be proved using the Cauchy-Schwarz inequality J13j that the log-normalizein F is a strictly 
convex and differentiable function on an open convex set 0. The log-density of an exponential 
family is 

lF{x-9) = {t{x),e)-F{e) + k{x) (7) 

To build an exponential family, we need to choose a basic density measure on a support X, 
a sufficient statistic t(x), and an auxiliary carrier measure term k{x). Taking the log-Laplace 
transform, we get 

F{e) = I e^^^^^'^'^+'^^^^dx, (8) 

A-6X 

and define the natural parameter space as the 9 values ensuring convergence of the integral. 

In fact, many usual statistical distributions such as the Gaussian, Gamma, Beta, Dirichlet, Pois- 
son, multinomial, Bernoulli, von Mises-Fisher, Wishart, WeibuU are exponential families in disguise. 
In that case, we start from their probability density or mass function to retrieve the canonical de- 
composition of Eq. pi See |36| for usual canonical decomposition examples of some distributions 
that includes a bijective conversion function 6{\) for going from the usual A-parameterization of 
the distribution to the 0-parametrization. 

Furthermore, exponential families can be parameterized canonically either using the natural 
coordinate system 6, or by using the dual moment parameterization r] (also called mean value 
parameterization) arising from the Legendre transform (see Appendix [B] for the case of Gaussians). 

2.2 Legendre duality and convex conjugates 

For a strictly convex and differentiable function F : IN — )• M, we define its convex conjugate by 

F*(ry)=sup{(r?,0)-F(e)} (9) 

The maximum is obtained for 77 = \7F{9) and is unique since F is convex VgZi?(T/;0) = 
-V^F{e) -< 0: 

\/glp(n;e) = r]-VF{9) = 0^r] = VF{e) (10) 

Thus strictly convex and differentiable functions come in pairs {F, F* ) with gradients being 
functional inverses of each other VF = (VF*)^^ and \7F* = (VF)^^. Legendre transform is an 
involution: (F*)* = F for strictly convex and differentiable functions. In order to compute F* , we 
only need to find the functional inverse (VF)"^ of VF since 

F*{rj) = ((VF)-i(r7),r?) -F((VF)-i(77)). (11) 

However, this inversion may require numerical solving when no analytical expression of VF^^ 
is available. See for example the gradient of the log-normalizer of the Gamma distribution |36j, the 
Dirichlet or von Mises-Fisher distributions [8]. 



^Also called in the literature as the log-partition function, the cumulant function, or the log-Laplace function. 



2.3 Bregman divergence 

A Bregman divergence Bp is defined for a strictly convex and difFerentiable generator F as 

BpiOi : 62) = F{6i) - F{92) - {9i - 62, VF(^2)). (12) 

The Kullback-Leibler divergence (relative entropy) between two members pi = pf{x;6i) and 
P2 = Pf{x'-,02) of the same exponential family amounts to compute a Bregman divergence on the 
corresponding swapped natural parameters: 

KL{pi:p2) = I pi{x)\og^—-dx, (13) 

Aex P2{x) 

= BFie2:0i), (14) 

= F{92)-F{ei)-{e2-ei,vF{ei)) (15) 

The proof follows from the fact that E[t{X)] = f^^^t{x)pF{x;e)dx = VF{0) [39]. Using 
Legendre transform, we further have the following equivalences of the relative entropy: 

BFie2:ei) = BF,irji-V2), (16) 

= F{92)+F*{rii)-{e2,Vi), (17) 

where t] = 'VF{9) is the dual moment parameter (and 9 = 'VF*(rj)). Information geometry [3] 



often considers the canonical divergence Cf of Eq. 17 that uses the mixed coordinate systems 9/ri, 
while computational geometry pL2j tends to consider dual Bregman divergences, Bf or Bp*, and 
visualize structures in one of those two canonical coordinate systems. Those canonical coordinate 
systems are dually orthogonal since V^F(0)V^-F*(ry) = /, the identity matrix. 

2.4 Maximum Likelihood Estimator (MLE) 

For exponential family mixtures with a single component M ~ Ef{Oi) {k = 1, u;i = 1), we easily 
estimate the parameter 9i. Given n independent and identically distributed observations xi, ...,Xn, 
the Maximum Likelihood Estimator (MLE) is maximizing the likelihood function: 

§ = argmaxgg0L(6';xi,...,x„), (18) 

n 

= argmaxgge J|pi7'(xj;6'), (19) 

= argmax.gee^^-^^*^"^^''^"''^'^^'^"^^ (20) 

For exponential families, the MLE reports a unique maximum since the Hessian of F is positive 
definite {X ~ Epie) => V^F = var[t(X)] ^ 0): 

1 " 

n . 
1=1 



Exponential Family 


<^ 


Dual Bregman divergence 


Pf{x\9) 




Bf* 


Spherical Gaussian 


<^ 


Squared Euclidean divergence 


Multinomial 


<^ 


Kullback-Leibler divergence 


Poisson 


4^ 


/-divergence 


Geometric 


4^ 


Itakura-Saito divergence 


Wishart 


4^ 


log-det/Burg matrix divergence 



Table 1: Some examples illustrating the duality between exponential families and Bregman diver- 
gences. 



The MLE is consistent and efficient with asymptotic normal distribution: 

e^N(e,-ip\9) 

\ n 
where Ip denotes the Fisher information matrix: 



(22) 



Ipie) = vai[t{X)] = V'^F{e) = (V2G(r/))-i (23) 

(This proves the convexity of F since the covariance matrix is necessarily positive definite.) Note 
that the MLE may be biased (for example, normal distributions). 

By using the Legendre transform, the log-density of an exponential family can be interpreted 
as a Bregman divergence [9]: 



logpF{x;9) = -BF*{t{x) ■.i]) + F*{t{x)) + k{x) 



(24) 



Table [T] reports some illustrating examples of the Bregman divergence f-)- exponential family 
duality. Let us use the Bregman divergence-exponential family duality to prove that 



6 = argmax\\ ppixi; 6) = VF ^ / i(a 
6*6© -'■-'- \ ^-^ 

Maximizing the average log-likelihood I = - logL, we have: 



(25) 



1 " 
max0g]N l{0; xi, ..., x„) = - V'((t(xj), 6*) - F{9) + k{xi)) 

n ^-^ 
1=1 

1 " 
maxeeiN - T^ -Bp* (t(xi) : rj) + F* {t{xi)) + k{xi) 

i=l 
1 " 

= min.^gM - ^ Bf* it{xi) : rj) 



(26) 

(27) 
(28) 



i=l 



Since right-sided Bregman centroids defined as the minimum average divergence minimizers 
coincide always with the center of mass [9] (independent of the generator F) , it follows that 



T] 



1 " 

-^tix,) = VF{e). 



(29) 



i=l 



It follows that 7) = (VF)-i(^ ^"^^ t(xi)). 

In information geometry [3], the point P with ry-coordinate fj (and ^-coordinate \7F~ (fj) = 0) 
is called the observed point. The best average log-likelihood reached by the MLE at i) is 

1 " 

/(^;xi,...,x„) = -y^{-BF*{t{xi):fi)+F*{t{xi))+k{xi)), (30) 

i=l 

1 " 

= -Y.^-F*{t{x,)) + F*{fj) + {t{x,)-f,,VF*{f^))+F*{t{x,)) + k{x,)), (31) 

n /l " \ 

= F*(^) + -j;fc(x0 + (-5]t(x.)-r},^), (32) 



n ■' — ' \ n 

1=1 \ i=l 



F*{f,) + -Y,k{x,). (33) 



n . 
1=1 



The Shannon entropy Hf{9) oi pf{x;9) is Hf{9) = —F*{'q) — f k{x)pF{x;6)dx [39]. Thus 
the maximal likelihood is related to the minimum entropy (i.e., reducing the uncertainty) of the 
empirical distribution. 

Another proof follows from the Appendix [A| where it is recalled that the Bregman information [9] 
(minimum of average right-centered Bregman divergence) obtained for the center of mass is a Jensen 
diversity index. Thus we have 

I = -J^.(^i(x,)) + ^X^F*(t(x,)) + ^f;A:(x,), (34) 

i=l i=l i=l 

Y^F*{t{x.))-F*{f,)]+-Y,F*itix^)) + -Y,Hx^), (35) 

i=l / j=l j=l 

1 " 

= F*{r^) + -Y,Hx^) (36) 



n 
%=\ 



Appendix |B] reports the dual canonical parameterizations of the multivariate Gaussian distri- 
bution family. 

3 /c-MLE: Learning mixtures with given prescribed weights 

Let X = {xi, ..., x„} be a sample set of independently and identically distributed observations from 
a finite mixture m{x\w,9) with k components. The joint probability distribution of the observed 
observations Xj's with the missing component labels Zj's is 

n 

p{xi,zi,...,Xn,Zn\w,9) = ^p{zi\w)p{xi\zi,9) (37) 

i=l 



To optimize the joint distribution, we could test (theoretically) all the k"^ labels, and choose the 
best assignment. This is not tractable in practice since it is exponential in n for k > 1. Since we 
do not observe the latent variables zi, ..., z„, we marginalize the hidden variables to get 

n k 

p{xi,...,Xn\w,9) = Yi'^pizi = j\w)p{xi\zi = j,6j) (38) 

The average log-likelihood function is 

l{xi,...,Xn\w,9) = -logp{xi,...,Xn\w,e), (39) 

n 



^71 k 

-'^^'^s'^Pizi = j\w)p{xi\zi =j,9j). (40) 



n 



Let bji^Zi) = 1 if and only if Xi has been sampled from the jth component, and otherwise. We 
have the complete average log-likelihood that is mathematically rewritten as 

^ n k 

l{xx,Zx,...,Xn,Zn\w,d) = - ^ log JJ(WjPF(a;i|^i))'^^^^'^ (41) 

'^ i=l j=l 

^ n k 



i=i j=i 



Using the bijection between exponential families and dual Bregman divergences [9j, we have the 
mathematical equivalence logpF{x\6j) = —Bp*{t{x) : rjj) + F*{t{x)) + k{x), where rjj = VF{6j) 
is the moment parameterization of the j-th component exponential family distribution. It follows 
that the complete average log-likelihood function is written as 



1 

l{xi,...,Xn\w,9) = -'^'^6j{zi){-BF*{t{xi):7]j) + F*{t{xi)) + k{x^)+logWj) (43) 

(^ n k \ 1 '^ 

-^^SJ{z^){-BF*{tixi) : Vj) + ^ogWj) +-Y,F*{t{xi)) + k{xi)iA) 
^ i=l j=l J "^ i=\ 

By removing the constant terms - X]"=i(-^*(i(a;i)) ^k{xi)) independent of the mixture moment 
parameters (the r/'s), maximizing the complete average log-likelihood amounts to equivalently min- 
imize the following loss function: 



n k 



~^' = ^EE^i(^^)(^^*W^^)^^^)-l°g^^)' (45) 

'^ i=i j=i 

1^ k 
= - V min(5F* iVi ■rij)- log Wj), (46) 

= kmeansF*,iogu>(>' : H), (47) 



where y = {yi = t{xi),...,yn = t{xn{} and H = {r/i,. ..,%}. 

Remark 1 This is the argmin of Eq. [7^ that gives the hidden component labels for the Xi 's. 

Remark 2 Observe that since \/i G {1, ...,k}, — logWi > (since Wi < 1), we have the following 
additive dual Bregman divergence BF*{yi : r]j) — log Wj > per cluster. Depending on the weights 
(e.g., w — )• 0), we may have some empty clusters. In that case, the weight of a cluster is set to 
zero (and the component parameter is set to by convention). Note that it makes sense to consider 
(< k) -means instead of k -means in the sense that we would rather like to upper bound the maximum 
complexity of the model rather than precisely fixing it. 



Eq. 46 is precisely the loss function of a per-cluster additive Bregman fc-means (see the ap- 
pendix |A[ ) defined for the Legendre convex conjugate F* of the log-normalizer F of the exponen- 
tial family for the sufficient statistic points y = {yi = t{xi)}f^i. It follows that any Bregman 
/j-means heuristic decreases monotonically the loss function and reaches a local minimum (cor- 
responding to a local maximum for the equivalent complete likelihood function). We can either 
use the batched Bregman Lloyd's /c-means [9], the Bregman Hartigan and Wong's greedy cluster 
swap heuristic |231l52j. or the Kanungo et al. |25] (9 + e)-approximation global swap approximation 
algorithm. 

Remark 3 The likelihood function L is equal to e . The average likelihood function L is defined 
by taking the geometric mean L = L". 

The following section shows how to update the weights once the local convergence of the 
assignment-?] of the /c-MLE loop has been reached. 

4 General /c-MLE including mixture weight updates 



When A;-MLE with prescribed weights reaches a local minimum (see Eq. W4l and Eq. 46 and the 
appendix |A|), the current loss function is equal to 



n k 



^ EE'^^(^*)(^^*(*(^^) • ^j) - log^^) - ^ E^*(^(^^)) + ^(^*) K^8) 



n ^ — ' ^ — ' \ n 

1=1 j=l \ i=l 



Minimized by additive Bregman fc-means, see Appendix 

k /^ n \ 



j=i \ i=i 



where ai = -^ denotes the proportion of points assigned to the i-th cluster Cj, and aiJF*{Ci) is 
the weighted Jensen diversity divergence of the cluster. In order to further minimize the average 



complete likelihood of Eq. 49, we update the mixture weights Wi^s by minimizing the criterion: 



k 



min y. ~'^j log Wj (50) 

min H^{a : w), (51) 
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where H^{p : q) = —J2i=iPi^°SQi denotes the Shannon cross-entropy, and A^ the (A; — 1)- 
dimensional probabihty simplex. The cross-entropy H^ (p : q) is minimized for p = q, and yields 
H^{p,p) = H{p) = — X]i=iPilogpi, the Shannon entropy. Thus we update the weights by taking 
the relative proportion of points falling into the clusters: 

yie {l,...,k],Wi-(^ ai. (52) 

After updated the weights, the average complete log-likelihood is 

k 



J=Y,WiJF*{Ci)+H{w)- (i^F*(t(x,)) + A:(xi)] . (53) 

We summarize the A;-MLE algorithm in the boxed Algorithm [T] 

Algorithm 1 Generic fc-MLE for learning an exponential family mixture model. 
Input: 



X 
F 

VF 
VF'^ 
t{x) 
k 



a set of n identically and independently distributed observations: X = {xi, ..., x„} 

log-normalizer of the exponential family, characterizing Ep 

gradient of F for moment T/-parameterization: r] = 'VF{6) 

functional inverse of the gradient of F for ^-parameterization: 6 = VF^^{r]) 

the sufficient statistic of the exponential family 

number of clusters 



• 0. Initialization: Vi G {1, ..., A;}, let Wi = -/: and r/j = t(xj) 
(Proper initialization is further discussed later on). 

• 1. Assignment: Vi G {1, ...,n},Zi = SiXgTaiTC^'^Bp*{t{xi) : rjj) — logWj. 
Let Vi G {1, ..., k} Ci = {xj\zj = i} be the cluster partition: X = U^^^Ci. 
(some clusters may become empty depending on the weight distribution) 

• 2. Update the r/-parameters: \/i G {1, ...,k},rji = j^Ylx&c^^^)- 

(By convention, r/j = if \Ci\ = 0) Goto step 1 unless local convergence of the complete 
likelihood is reached. 

• 3. Update the mixture weights: Vi G {1, ...,k},Wi = ^|Cj|. 

Goto step 1 unless local convergence of the complete likelihood is reached. 

Output: An exponential family mixture model m{x) (EFMM) parameterized in the natural coor- 
dinate system: Vi G {1, ..., A;},6'i = (VF)-i(r?i) = VF*{r]i): 

k 

m{x) = ^WiPF{x\6i) 
i=l 



Remark 4 Note that we can also do after the assignment step of data to clusters both (i) the 
mixture rj-parameter update and (ii) the mixture w-weight update consecutively in a single iteration 
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of the k-MLE loop. This corresponds to the Bregman hard expectation-maximization (Bregman 
Hard EM) algorithm described in boxed Algorithm^ This Hard EM algorithm is straightforwardly 
implemented in legacy source codes by hardening the weight membership in the E-step of the EM. 
Hard EM was shown computationally efficient when learning mixtures of von-Mises Fisher (vMF) 
distributions ^8]. Indeed, the log-normalizer F (used when computing densities) of vMF distribu- 
tions requires to compute a modified Bessel function of the first kind 14 i^ , that is only invertible 
approximately using numerical schemes. 

Algorithm 2 Hard EM for learning an exponential family mixture model. 

• 0. Initialization: \/i G {1, ..., k}, let Wi = -^ and rn = t{xi) 
(Proper initialization is further discussed later on). 

• 1. Assignment: Vi G {1, ...,n},Zi = ar grain j^iBF*{t{xi) : rjj) — logWj. 
Let Vi G {1, ..., k} Ci = {xj\zj = i} be the cluster partition: X = U^^iCi. 

• 2. Update the r/-parameters: Vi G {1, ...,k},r]i = w~\Ylx(^c^i^)- 

• 3. Update the mixture weights: Vi G {l,...,k},Wi = —. 

• Goto step 1 unless local convergence of the complete likelihood is reached. 



We can also sparsify EM by truncating to the first D entries on each row (thus, we obtain 
a well-defined centroid per cluster for non-degenerate input). This is related to the sparse EM 
proposed in |35j . Degeneraties of the EM GMM is identified and discussed in [6j. Asymptotic 
convergence rate of the EM GMM is analyzed in [35] . 

There are many ways to initialize A:-means [^. Initialization shall be discussed in Section pi 



5 Speeding up /c-MLE and Hard EM using Bregman NN queries 

The proximity cells {Vi, ..., Vfc} induced by the cluster centers C = {ci, ..., Cfc} (in the r/-coordinate 
system) are defined by: 

Vj = {x£X\ BF*{t{x) : rij)-\ogWj < BF*{t{x) : r?^ ) - log it;^ , V/ G {1, ...,k}\{j}} (54) 

partitions the support X into a Voronoi diagram. It is precisely equivalent to the intersection 
of a Bregman Voronoi diagram for the dual log-normalizer F* with additive weights [12] on the 
expectation parameter space WL = {rj = VF{9) \ G M} with the hypersurfac^T = {t{x) \ x G X}. 
For the case of Gaussian mixtures, the log-density of the joint distribution Wippix; fii,T,i) induces a 
partition of the space into an anisotropic weighted Voronoi diagram [27] . This is easily understood 
by taking minus the log-density of the Gaussian distribution (see Eq. ^: 

- logp{x; Hi, Sj) = -D^-i (x - /ij, X - /ij) + - log |Si| + - log 27r, (55) 



Note that there is only one global minimum for the distance Bf- {y : r/) with y G T. 
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Figure 3: From the source color image (a), we buid a 5D GMM with k = 32 components, and color 
each pixel with the mean color of the anisotropic Voronoi cell it belongs to. 

with Mq the squared Mahalanobis distance MQ[x,y) = {x — y) Q{x — y). This is an additively 
weighted Bregman divergence with mass rrii = ^ log |Sj| + ^ log 27r and generator Fi{x) = {x, S~ x), 
the precision matrix (see the Appendix) . Figure p^ displays the anisotropic Voronoi diagram |27j 
of a 5D xyRGB GMM restricted to the xy plane. We color each pixel with the mean color of the 
anisotropic Voronoi cell it belongs to. 

When the order of the exponential family (i.e., number of parameters) is small (say, -D < 3), 
we can compute explicitly this additively weighted Bregman Voronoi diagrams in the moment 
parameter space M, and use proximity location data-structures designed for geometric partitions 
bounded by planar walls. Otherwise, we speed up the assignment step of A;-MLE/Hard EM by 
using proximity location data-structures such as Bregman ball trees [l5] or Bregman vantage point 
trees |40j. See also jl]. 

Besides Lloyd's batched /c-means heuristic |3H I33 | [T9]. we can also implement other /c- means 
heuristic like the greedy Hartigan and Wong's swap [23\ |52] in fe-MLE that selects a point and 
optimally reassign it, or Kanungo et al. [25j global swap optimization, etc. 

Remark 5 The MLE equation fj = VF{9) = - X^ILi ^(^«) '^o?/ yield a transcendental equation. 
That is, when (VF)^^ is not available analytically (e.g., von Mises- Fisher family JB^), the convex 
conjutate F* needs to be approximated by computing numerically the reciprocal gradient VF~^ 
(see Eq. 11). Sra 14^1 focuses on solving efficiently the MLE equatioTrl for the von Mises-Fisher 
distributions. 



'See also, software R package movMF 
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6 Initializing A;-MLE using A;-MLE++ 

To complete the description of /c-MLE of boxed Algorithm 1, it remains the problem to properly 
initializing A;-MLE (step 0). One way to perform this initialization is to compute the global MLE 
parameter for the full set X: 



^ = VF-M-Vt(x,) , (56) 




and then consider the restricted exponential family of order d < D with restricted sufficient statistic 
the first d components of full family statistic (ti(x), ...,id(x)). We initialize the i-th cluster with 
r^i = (ti(xi), ...,trf(xj), 17(^+1, ...,t}d). For the case of multivariate Gaussians with D = -^ — '-, this 
amounts to compute the covariance matrix S of the full set and then set the translation parameter to 
Xi'- Vi — (^i) "2^ ~'~'^*'^r)) (^®^ appendix B). This initialization is a heuristic with no guaranteed 
performance on the initial average complete log- likelihood / compared to the best one /* . Note that 
when D = d (e.g., Poisson, Weibull, Rayleigh, isotropic Gaussian, etc.), we need to have distinct 
initializations so that instead of taking the global MLE, we rather split the data set into k groups 
of size ^, and take the MLE of each group for initialization. A good geometric split is given by 
using a Voronoi partition diagram as follows: We run Bregman fc-means on y for the dual convex 
conjugate F* and set the mixture parameters as the MLEs of clusters and the weights as the relative 
proportion of data in clusters. This corroborates an experimental observation by Banerjee et al. [9] 
that observes that clustering works experimentally best if we choose the dual Bregman divergence 
associated with the exponential family mixture sample set. 

Let us further use the dual Bregman /c-means interpretation of EM to perform this initialization 
efficiently. Assume uniform weighting of the mixtures. That is, Vi G {1, ...,k^,Wi = t- 



Maximizing the average complete log-likelihood amounts to minimize (see Eq. 46): 

1 ^ fc 
/" = - > mmBp-'iyi = t{xi) : rjj). (57) 

1=1 

The likelihood function L(xi, ...,Xn\0-,w) is 



j^ _ -nkraeansp, {C)+n\ogk+Y^"^-^^{F* {xi)+k{xi)) /ro-j 

Thus for uniform mixture weights, the ratio between two different /c-means optimization with 
respective cluster centers C and C is: 

— n(kmeansj7» (C)— kmeanspi* (C)) (^Ci\ 

We can use the standard Bregman A;-means++ initialization [2j on the convex conjugate F* 
that gives probabilistically a guaranteed 0(/i~^ log fc) performance, where /x is a constant factor to 
be explained below. The Bregman A;-means-|— |- algorithm is recalled in boxed Algorithm [3j 

Let kmeans^ denote the optimal Bregman /c-means average loss function for generator F . Breg- 
man /c-means-l— |- described in Algorithm ^ ensures that 

kmeansF*(3^ : C) < kmeansF(3^ : C) < ^(2 -F log /c)kmeansF* (3^ : C) (60) 

/^ 
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Algorithm 3 Bregnian A;-means++: probabilistically guarantees a good initialization. 

• Choose first seed C = {yi}, for / uniformly random in {1, ...,n}. 

• For i ■(— 2 to /c 

— Choose Ci € {yi, ■■■,yn} with probability 

BFic^■.C) ^ BFJy-.C) 

Y17=i BpiVi ■ C) kmeansF(3^ : C) ' 
where Bp{c : C) = miupgc Bf{c : p). 

— Add selected seed to the initialization seed set: C ^ Cu{cj}, and reiterate until \C\ = k. 



The factor fj, in the upper bound is related to the notion of ^u-similarity that we now concisely 
explain. Observe that the squared Mahalanobis distance MQ{p,q) = {p — q)^Q{p — q) satisfies the 
double triangle inequality: 

MQ{p,q)<2{MQ{p,r) + MQ{r,q)). (61) 

A Bregman divergence is said to have the /i-similarity on a domain y if there exists a positive 
definite matrix Q y on y = conv(2/i, .■.,yn) and a real < ^ < 1 such that 

fiMg {p, q) <BF{p:q)< Mq {p,q) (62) 

Since a Bregman divergence can also be interpreted as the remainder of a Taylor expansion 
using the Lagrange error term: 

Bf{p ■.q) = {p- qf^^^^ip - q), (63) 

with epq being a point on the line segment \pq]. It follows that by considering the Hessian V^-F on 
a compact subset y = conv(yi, ...,yn), we get a bound [41j for /i as follows: 

P,g&ymeiXy(.y{p-q)TV^F{y){p-q) 

By considering a hyperrectangle bounding the convex hull 3^ = conv(yi, ...,y„), it is usually 
easy to compute bounds for /j,. See [2j for some examples. 

The notion of /^-similarity also allows one to design fast proximity queries [T] based on the 
following two properties: 

Approximately symmetric. 

BF{p:q)<-BF{q,p) (65) 

1^ 



Deficient triangle inequality. 



Bf{p : q) < -{Bf{p : r) + S^(g : r)) (66) 
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For mixtures with prescribed but different non-zero weighting, we can bound the hkehhood 
ratio using w'^ = maxju^j > ^ and w~ = vamiWi. When mixture weights are unknown, we 
can further discretize weights by increments of size 6 {0{l/5^) such weight combinations, where 
each combination gives rise to a fixed weighting) and choose the initiahzation that yields the best 
hkehhood. 

7 Concluding remarks and discussion 

Banerjee et al. [9] proved that EM for learning exponential family mixtures amount to perform 
a dual Bregman soft clustering. Based on the duality between exponential families and Bregman 
divergences, we proposed fc-MLE, a Bregman hard clustering in disguise. While A:-MLE decreases 
monotonically the complete likelihood until it converges to a local minimum after a finite number of 
steps, EM monotonically decreases the expected complete likelihood and requires necessarily a pre- 
scribed stopping criterion. Because fc-MLE uses hard membership of observations, it fits the doubly 
stochastic process of sampling mixtures (for which soft EM brings mathematical convenience) . 

Both /c-MLE and EM are local search algorithm that requires to properly initialize the mixture 
parameters. We described /c-MLE++, a simple initialization procedure that builds on Bregman 
A;-means++ [2j to probabilistically guarantee an initialization not too far from the global optimum 
(in case of known weights). While we use Lloyd A;-means [31] heuristic for minimizing the fe- means 
loss, we can also choose other fc-means heuristic to design a corresponding /c-MLE. One possible 
choice is Hartigan's greedy swap [52j that can further improve the loss function when Lloyd's k- 
means is trapped into a local minimum. A local search technique such as Kanungo et al. swap |25j 
also guarantees a global (9 + e)-approximation. 

The MLE may yield degenerate situations when, say, one observation point is assigned to one 
component with weight close to one. For example, the MLE of one point for the normal distribution 
is degenerate as a — )• (and w — )• 1)), and the likelihood function tends to infinity. That is the 
unboundedness drawback of the MLE. See \AS\ [TT] for further discussions on this topic including a 
penalization of the MLE to ensure boundedness. 

Statistical mixtures with k components are generative models of overall complexity k — 1 + 
kD, where D is the order of the exponential family. An interesting future direction would be to 
compare mixture models versus a single multi-modal exponential family [16] (with implicit log- 
normalizer F). We did not address the model selection problem that consists in determining the 
appropriate number of components, nor the type of distribution family. Although there exists many 
criteria like the Akaike Information Criterion (AIC), model selection is a difficult problem since 
some distributions exhibit the indivisibility property that makes the selection process unstable. 
For example, a normal distribution can be interpreted as a sum of normal distributions: V/c € 
N, N{/j,a^) = J2i=i-^ ik^ir)- From the practical point of view, it is better to overestimate 
k, and then perform mixture simplification using entropic clustering [20]. Belkin and Sinha |10j 
studied the polynomial complexity of learning a Gaussian mixture model. 

We conclude by mentioning that it is still an active research topic to find good GMM learning 
algorithms in practice (e.g., see the recent entropy-based algorithm 
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A /c-Means with per-cluster additively weighted Bregman diver- 
gence 

/c-Means clustering asks to minimize the cost function kmeans(A' : C) by partitioning input set 
X = {xi, ..., Xn} into k clusters using centers C = {ci, ..., Cfc}, where 

kmeans(^ : C) = — > min \\xi — Cj\\ . (67) 



There are several popular heuristics to minimize Eq. 67 like Lloyd's batched method [30] or 
Hartigan and Wong's swap technique |23j . Those iterative heuristics guarantee to decrease mono- 
tonically the A;- means loss but can be trapped into a local minimum. In fact, solving for the 
global minimum kmeans*(Af : C) is NP-hard for general k (even on the plane) and for k = 2 and 
arbitrary dimension of datasets. Kanungo et al. |25] swap optimization technique guarantees a 
(9 + e)-approximation factor, for any e > 0. 

Let us consider an additively weighted Bregman divergence Bp.^rui per cluster as follows: 

Bp^m, {p-q)= Bp^ {p: q)+mi, (68) 



with rrii denoting the additive mass attached to a cluster centeirl and Bp^ the Bregman divergence 
induced by the Bregman generator Fj defined by 

Bp.ip : q) = F,{p) - F,{q) - {p - q,VFi{q)), (69) 

Remark 6 For k-MLE, we consider all component distributions of the same exponential family 
Ep, and therefore all Fi = F* 's are identical. We could have also considered different expo- 
nential families for the components hut this would have burdened the paper with additional no- 
tations although it is of practical interest. For example, for the case of the multivariate Gaus- 
sian family, we can split the vector parameter part from the matrix parameter part, and write 
F{6vi,9Mi) = F0^,j.{9.ui) = Fi{6^i). 

Let us extend the Bregman batched Lloyd's /c-means clustering [5] by considering the generalized 
fc-means clustering loss function for a data set y = {yi,...,yn} and a set C of fc cluster centers 
C = {ci,...,Ck}: 

kmeans(3^,C) = min — > min Di(yi : Cj). (70) 

ci,...,cfe n ^ j=i 
1=1 

Let us prove that the center-based Lloyd's /c-means clustering algorithm monotonically decreases 
this loss function, and terminates after a finite number of iterations into a local optimum. 



^In this paper, we have m^ > by choosing rrii = —logWi for Wi < 1, but this is not required. 
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• When k = 1, the minimizer of kmeans(3^, C = {ci}) is the center of mass (always independent 
of the Bregman generator): 



lEy^ = y^ (71) 

and the Bregman information [9j is defined as the minimal 1-means loss function: 



ci 

n 

i=l 



kmeansFi,mi(3^,{ci}) = /fi(3^) (72) 

1 " 

= -VFi(y,)-Fi(y) + mi, (73) 

n ^-^ 

= mi + J^,(3^), (74) 

where y=\ YTi=\ Vi and 

1 " 

JfAvi. -.Vn) = -y^Fiivi) - Fi{y) > 0, (75) 



n 

i=\ 



denotes the Jensen diversity index |38j 



When /c > 2, let q denote the cluster center of the i-th cluster C\ ay of the partition 

y = uJLj^Q at the t^^ iteration. The generalized additively weighted Bregman /c-means loss 
function can be rewritten as 



k 

kmeansF,m{Ci , ■■■,Cl^' : 4*%..., 4*0 = ^ ^ ^F,,m,(y : q). (76) 



^=1 j;ec(*) 



Since the assignment step allocates yi to their closest cluster center aigmmj^iB F^^miiVi ■ Cj] 
we have 



kmeansi;',m(CJ ,---,C}. ^ : c^ , ..., 4^ < kmeansF,m(Ci , ...,Q . 4 \ ...,4^. (77) 
Since the center relocation minimizes the average additively weighted divergence, we have 

kmeansF,m(Ci ■*,..., Q*^ : cf^ V..,4*^ ) < kmeansir,m(Ci ' ,...,C)^^ ^cfv..,4 )• 

(78) 

By iterating the assignment-relocation steps of fc- means, and cascading the inequalities by 
transitivity, we get 

kmeansF,m(Ci \...,Q*"^ : 4*"^ \---'>Ck ) ^ kmeans^^m (C|*\ . . . , C^*^ : 4 ,"-,4 ) (79) 

Since the loss function is trivially lower bounded by - min^^^ ttt-j (and therefore always positive 
when all rrii > 0), we conclude that the generalized Bregman A:-means converge to a local 
optimum, after a finite numbeijj of iterations. 



^We cannot repeat twice a partition. 
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Furthermore, the loss function can be expressed as 



kmeansF,rn(Ci,...,Cfc : ci,...,Cfc) = - ^ ^ -Bir^,m^(y : q), (80) 

i=i yeCi 
k k 

= ^WiJF,{Ci) + '^Wimi, (81) 

i=l i=l 

With Jp^iCi) = ^EyeC.^i(y) - ^^(c.) > (and a = %|^), and ^«^ = ^ for all i e 
{1, ..., A;}, the cluster relative weights. 

When all Fi are identical to some generator F, we have the following loss function: 

k k 

kmeanSiT'^m = /^ WiJpiCi) + 2, Wiirii (82) 

«=1 i=l 

The celebrated /s-means of Lloyd [3^ minimizes the weighted within-cluster variances (for the 
Bregman quadratic generator F{x) = {x, x) inducing the squared Euclidean distance error) as 
shown in Eq. [STj with Bregman information: 

= E^^2/,y)-2/y,5]^y\-(y-,y-), (86) 

^ V ' 

y 

= T^\J2^y,y)-{y,y) = My), (87) 

the variance. When all cluster generators are identical and have no mass, it is shown by Banerjee 
et al. [9] that the loss function can be equivalently rewritten as: 



kmeansF(P:C) = Jf{V) - Jf{C) = '^WiJF{Ci), (88) 

= If{V)-If{C) (89) 

Remark 7 Note that we always have c = y. That is, the centroid y of set y is equal to the 
barycenter c of the cluster centers C (with weights taken as the relative proportion of points falling 
within the clusters. 
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Remark 8 A multiplicatively weighted Bregman divergence miBp- is mathematically equivalent to 
a Bregm,an divergence B^.p- for generator rmFi, provided that rm > 0. 



As underlined in this proof, Lloyd's fc-means [30] assignment-center relocation loop is a generic 
algorithm that extends to arbitrary divergences Di guaranteeing unique average divergence min- 
imizers, and the assignment/relocation process ensures that the associated /c-means loss function 
decreases monotonically. Teboulle studied [5l] generic center-based clustering optimization meth- 
ods. It is however difficult to reach the global minimum since A;-means is NP-hard, even when data 
set y lies on the plane |53j for arbitrary k. In the worst case, /c-means may take an exponential 
number of iterations to converge [53] , even on the plane. 

B Dual parameterization of the multivariate Gaussian (MVN) 
family 

Let us explicit the dual 0-natural and 7y-moment parameterizations of the family of multivariate 
Gaussians. Consider the multivariate Gaussian probability density parameterized by a mean vector 
Xy = IJ, and a covariance matrix Ajv/ = S. 

p{x;X) = ^ g-|(:.-A„rA-/(:.-A„)^ (9Q^ 

(27r)2y^A^ 



-x'^Xjx + X^Xjx - -X'^iXjXy - - log27r - - log |Am| ) , (91) 



where the usual parameter is A = (A^,AAf) = (/^, S). Using the matrix cyclic trace property 
— gx'^A^ X = tr(— gXX-^A^ ) and the fact that (A^ )^ = A^ , we rewrite the density as follows: 



^M ^^/ "T \ 2^^ '-^M/ ( 2'^'''^*^'" ' 2""°~" ' 2 



p{x; A) = exp (x, A^/A^) + {--xx^, A^/) - -A^A^/A^ + - log 27r + - log | Aa/ | , (92) 



where the inner product of vector is (f i, ^2) = VJV2 and the inner product of matrices is {Mi, M2) 
tr[MfM2). Thus we define the following canonical terms: 



• sufficient statistics: t{x) = {x,—^xx'^) 



auxiliary carrier measure: k{x) = 0, 

natural parameter: 9 = {0^, 9m) = (A^j A^, A^^ ). 

log-normalizer expressed in the A-coordinate system: 

F(A) = \xlXlJXy + - log27r + ^ log |Am| (93) 

Since A^ = 9^j 9^ (and A^ = 9]j9^,j ) and log \Xm\ = — log I^Af I) we express the log-normalizer 
in the ^-coordinate system as follows: 

F{0) = \eleil9y - i log I^A/I + ^ log 27r (94) 
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Since the derivative of the log determinant of a symmetric matrix is V x log |X| = X ^ and the 
derivative of an inverse matrix trace 



Vxtr(^X"^S) = -{X-^BAX- 



1\T 



(95) 



(applied to ^tr(6i^ 0119^) 



-\{6j^j OyO^O^j )), we calculate the gradient VF of the log-normalizer 



as 



with 



Tlv 



VM 



VF{9) = {Vg^F{e),VeMF{0)) 



E[x] = /u, 



2^' 



E 



-XX 



-(/i/x^ + S), 



(96) 



(97) 
(98) 

(99) 
(100) 



where r] = \/F(9) = {t^v^^Im) denotes the dual moment parameterization of the Gaussian. 
It follows that the KuUback-Leibler divergence of two multivariate Gaussians is 



KL(p(x;Ai):p(x;A2)) = 3^(02:91 



(101) 



= ^(tr(S2-iSi)-log|SiS2-i| + (M2-/ii)^S2-i(/i2-/ii)). (102) 
Note that the Kullback-Leibler divergence of multivariate Gaussian distributions [17] can be 



decomposed as the sum of a Burg matrix divergence (Eq. 106) with a squared Mahalanobis distance 



(Eq. 106) (both being Bregman divergences): 



KL{pF{x\fj,i,T.i) : pf{x\^2,^2) 



^ (tr(S2-^Si) - log |SiS2-i| + {^l2 - ^llf^2H^^2 - ml)03) 

1„,„ „ . 1 
2' 



-fi(Si,S2) + -M2-i(^l,/X2), 



with 



B(Si:S2) = tr(SiS2^)-log|SiS^i|-rf, 
Mj.^i{Hl,H2) = (/ii -M2)'^5]2^(/il-M2)• 
To compute the functional inverse of the gradient, we write: 

e = VF-^{r]) = VF*(r/). 
Since tjm = —Hvvvl + ^a/)i '^e have: 



(104) 



(105) 
(106) 



(107) 
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9m = {-2riM - ri^ril)-\ (108) 

9^ = {-2riM-Vvvlr^Vv. (109) 



Finally, we get the Legendre convex conjugate F*[r]) as: 



F*(7?) = (VF*(r/),r/)-F(VF*(7?)), (110) 

= -i^'^og{l + 2rf^r]'^r]^)- -\og\-r]M\- i^'^ogi'Ke). (Ill) 

C A:-MLE for Gaussian Mixture Models (GMMs) 

We explicit fc-MLE for Gaussian mixture models on the usual (//, S) parameters in Algorithm Hi 
The A;-MLE++ initialization for the GMM is reported in Algorithm [5] 

D Rayleigh Mixture Models (RMMs) 

We instantiate the soft Bregman EM, hard EM, /c-MLE, and fe-MLE++ for the Rayleigh distribu- 
tions, a sub-family of WeibuU distributions. 

- "^ 
A Rayleigh distribution has probability density ^e 2^^ where a G M^ denotes the mode of the 

distribution, and x G X = M+ the support. The Rayleigh distributions form a 1-order univariate 

2 

exponential family {D = d = 1). Re-writing the density in the canonical form e~2^ °s^- oga ^ ^^ 
deduce that t(x) =x^,9 = -^, k{x) = logx, and ^(cr^) =loga^ = log-2^ = -log(-26l) = F{9). 
Thus VF{9) = —^ = 7] and F*{r]) = {O^rf) — F{9) = — 1 + log ^. The natural parameter space is 
M = M^ and the moment parameter space is M = M^ (with rj = 2(7^). We check that conjugate 
gradients are reciprocal of each other since VF*{r]) = — = 9, and we have V^F(0)V^G(r/) = 
g^^ = 1 (i.e, dually orthogonal coordinate system) with V'^F{9) = p and V'^F*{ri) = -^. 
Rayleigh mixtures are often used in ultrasound imageries [47] . 

D.l EM as a Soft Bregman clustering algorithm 

Following Banerjee et al. |9], we instantiate the Bregman soft clustering for the convex conju- 
gate F*{ri) = — 1 -|- log|, t{x) = x^ and rj = 2(7^. The Rayleigh density expressed in the rj- 

2 2^2 

parameterization yields p{x; a) = p{x; -q) = -^e i . 
Expectation. Soft membership for all observations xi, ...,Xn'- 



^l<i<n,l<j<k,Wij = ^^P(^^;^j) (112) 

1^1=1 wip{xi;9i) 

(We can use any of the equivalent a, 9 oi rj parameterizations for calculating the densities.) 
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Algorithm 4 k-MLE for learning a GMM. 



Input: 
X : a set of n independent and identically distributed distinct observations: X = {xi, ...,Xn} 
k : number of clusters 

• 0. Initialization: 



— Calculate global mean n and global covariance matrix S: 



n "^-^ 

i=l 
1 '' 

-E 



/i = - yxi, 

n ^-^ 



k 

T --T 

n ^ — ' 



— Vi G {1, ..., A;}, initialize the rth seed as (/ij = Xj, Sj = S). 

1. Assignment: 

Vi G {1, ...,n},Zj = aicgmm'j^iMj.-i{x - iJ.i,x - fii) +\og \T.i\ - 2log Wi 
with My,-i {x — ^i,x — m) the squared Mahalanobis distance: Mq(x, y) = (x — y)^Q{x — y). 

i 

Let Cj = {xj\zj = i}, Vi G {1, ..., k} be the cluster partition: X = U,^^]^Cj. 
(Anisotropic Voronoi diagram [27]) 

2. Update the parameters: 

Vi G {1, ..., k],^li = nrr E ^' ^' = "|(rT E ^^^ ~ ^*^^ 

Goto step 1 unless local convergence of the complete likelihood is reached. 

3. Update the mixture weights: Vi G {1, ..., fc}, Wj = -|Cj|. 

Goto step 1 unless local convergence of the complete likelihood is reached. 
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Algorithm 5 k-MLE for GMM: 



Choose first seed C = {yi}, for I uniformly random in {1, ...,n}. 
For i ■(^ 2 to k 

— Choose Cj = (fii, Sj) with probability 

BF,{c^■.C) ^ Bp^jy-.C) 
Ya=i Bf* iUi ■ C) kmeansF* (3^ : C) ' 
where Bf*{c : V) = min^gp Bp*{c : p). 

F* (^, S) = - ^ log (1 - fi'^ififi'^ + s)- V) - ^ log I^^A^ + S| - ^ log 27T-d 

— Add selected seed to the initialization seed set: C -^ C U {cj}. 



Maximization. Barycenter in the moment parameterization: 

Vl<J<fc, r?, = ^^i7-.^-^(^-) , (113) 

V ^ 1^1=1 ^i,j 

D.2 A;-Maximum Likelihood Estimators 

The associated Bregman divergence for the convex conjugate generator of the Rayleigh distribution 
log-normalizer is 

Bf*{vi-V2) = F*{m)-F*{m)-{Vi-V2,VF*{r]2)), (115) 

= -1 + log— + l-log (??i-??2)(-l/r/2), (116) 

= hlog 1, (117) 

= IS(r/i:7?2) (118) 

This is the Itakura-Saito divergence IS (indeed, F* is equivalent modulo affine terms to — log r], 
the Burg entropy). 

1. Hard assignment. 

yi < i < n, Zi = argmin;^<j<^lS(Xj : rjj) — logWj 

Voronoi partition into clusters: 

VI < i < k,Cj = {xi I lS{xj : rjj) - \ogWj < lS(x^ : ??/) - logw/V/ / j} 
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2. ?7-parameter update. 



VI < J < k, Vj ^ TTTl^ X^ 



VI < J < fc, Gj = \l -r]j 



Go to 1. until (local) convergence is met. 

^veight update. 

VI < i < k, Wj = ^ 
n 

Go to 1. until (local) convergence is met. 

Note that fe-MLE does also model selection as it may decrease the number of clusters in order to 
improve the complete log-likelihood. If initialization is performed using random point and uniform 
weighting, the first iteration ensures that all Voronoi cells are non-empty. 

D.3 A;-MLE++ 

A good initialization for Rayleigh mixture models is done as follows: Compute the order statistics 
for the f , -^, — k~'^^ elements (in overall 0(n log /c)-time). Those pivot elements split the set X 
into k groups Afi, ..., Af^ of size ^, on which we estimate the MLEs. 

The /c-MLE++ initialization is built from the Itakura-Saito divergence: 

IS(?7i : r/2) = h log 1 

k-MLE+-F: 

• Choose first seed C = {yi}, for / uniformly random in {1, ...,n}. 

• For i -^ 2 to k 

— Choose Cj € yi = xf, ...,y„ = x^ with probability 

IS(q : C) 
Er=iIS(2/^:C) 

— Add selected seed to the initialization seed set: C -^ C U {cj}. 
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E Notations 



Exponential family: 

Pf{x; 6) = e(t(-)'0)-m+m 

X 

d 

D 



t{x) 
k{x) 
F 
VF 

p* 

Distribution parameterization: 



IN 

V 

M 

A 

L 
Pf{x; A) 

Pf{x;v) 

Mixture : 

m 

Ait 

H{w) 

H^ {p : q) 

Wi 

di 

Vi 
rh 
k 

n 

Clustering: 
X = {xi,... 

\X\,\C\ 
^1 ) • • • ) Zfi 

y = {yi 



t{xi),...,yn = t{Xn)} 



inner product (e.g., x'^y for vectors, ti(Y~^ X) for matrices) 

Exponential distribution parameterized using the ^-coordinate system 

support of the distribution family ({x | pf{x] 0) > 0}) 

dimension of the support X (univariate versus multivariate) 

dimension of the natural parameter space 

(uniparameter versus multiparameter) 

sufficient statistic {v = ^ ^27=1 *(^j)) 

auxiliary carrier term 

log-normalizer, log-Laplace, cumulant function (F : IN — )■ M) 

gradient of the log-normalizer (for moment r/-parameterization) 

Hessian of the log-normalizer 

(Fisher information matrix, SPD: V'^F[6) >- 0) 

Legendre convex conjugate 

canonical natural parameter 

natural parameter space 

canonical moment parameter 

moment parameter space 

usual parameter 

usual parameter space 

density or mass function using the usual A-parameterization 

density or mass function using the usual moment parameterization 

mixture model 

closed probability {d — l)-dimensional simplex 

Shannon entropy — X^j^^ WilogWi (with OlogO = by convention) 

Shannon cross-entropy —J2i=iP^'^S1 

mixture weights (positive such that J2i=i "^i — ^) 

mixture component natural parameters 

mixture component moment parameters 

estimated mixture 

number of mixture components 

mixture parameters 

sample (observation) set 

cardinality of sets: n for the observations, k for the cluster centers 

Hidden component labels 

sample sufficient statistic set 
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L(xi, 


...,x, 


o,fi,x 




WiJ 




i 




3 




C 




Cl,..., 


Ck 


ai,.... 


,Oik 


Bf 





Jf 

Evaluation criteria: 
If 

I'f 
If 

L'f 

knieansF 



likelihood function 

maximum likelihood estimates 

soft weight for Xi in cluster /component Cj {wj,9j) 

index on the sample set xi, ..., Xj, ..., x„ 

index on the mixture parameter set 9i, ...,9j, ...,6k 

cluster partition 

cluster centers 

cluster proportion size 

Bregman divergence with generator F: 

BF{e2,ei) = KL{pf{x : 9i) : pf{x : 62)) 
= Bf- {vi,V2) 

= F{e2) + F*{7]i)-{7]i,e2) 

Jensen diversity index: 

JF{pi,.:,Pn;Wi,...,Wn) = Yl=lWiF{pi) - F{YJl^^Wip.i) > 

average incomplete log-likelihood: 

[f(xi, ..., Xn) = }, Ya=1 log Ei=l WjPFixi] 9j) 

average complete log-likelihood 

l'p{xi,...,Xn) = ^YA=l'^OSWz,PF(.Xi;ezJ 

geometric average incomplete likelihood: 

Lf{xi, ..., x„) = e'^(^i'-'^") 

geometric average complete likelihood: 

L'^(xi, ...,x„) = e''F(^i'-'^") 

average /c-means loss function (average divergence to the closest center) 



1 " 
kmeansi?(A',C) = — > BF{xi : C) 

4 = 1 

1 '^ 



kmeansiT' , 



j=l xeCj 
k 

i=i 
= Jf{X)-Jf{C) 
average /c-means loss function with respect to additive Bregman divergences 
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